Zillow Home Value Index Analysis with PDFM Embeddings

Comparing Linear, Ridge, and Lasso Regression Models

Author

Zhanchao Yang

Introduction

This document demonstrates the application of Population Dynamics Foundation Model (PDFM) embeddings to predict Zillow Home Value Index (ZHVI) data. PDFM embeddings are 330-dimensional vector representations that capture complex spatial and demographic patterns.

Objectives

  1. Load and visualize Zillow Home Value Index data with geospatial mapping
  2. Join PDFM embeddings with home value data
  3. Build regression models to predict home values:
    • Linear Regression (baseline)
    • Ridge Regression (L2 regularization)
    • Lasso Regression (L1 regularization with feature selection)
  4. Evaluate and visualize model performance

Why Ridge and Lasso Regression?

Ridge and Lasso regression are particularly useful when working with high-dimensional embeddings (330 features) because they:

  • Ridge Regression (L2):
    • Penalizes large coefficients to prevent overfitting
    • Keeps all features but shrinks their impact
    • Performs well when many features contribute to the outcome
  • Lasso Regression (L1):
    • Performs automatic feature selection by shrinking some coefficients to zero
    • Identifies the most important embedding dimensions
    • Produces sparse models that are easier to interpret
  • Both:
    • Use cross-validation to optimize the regularization parameter (lambda)
    • More computationally efficient than stepwise regression
    • Provide better generalization on unseen data

Setup and Data Loading

Load Required Libraries

Code
# Data manipulation and analysis
library(tidyverse)
library(readr)

# Geospatial data handling
library(sf)
library(leaflet)

# Machine learning and modeling
library(caret)
library(MASS)  # For stepwise regression
library(glmnet)  # For Ridge and Lasso regression

# Model evaluation
library(Metrics)
library(ggplot2)

# For table formatting
library(knitr)
library(kableExtra)

# Set random seed for reproducibility
set.seed(42)

Download Zillow Home Value Index Data

Code
# Download ZHVI data
zhvi <- read.csv("https://github.com/opengeos/datasets/releases/download/us/zillow_home_value_index_by_county.csv")

Load and Prepare ZHVI Data

Code
# constuct correct State FIPS code and Municipal FIPS code with leading zeros
zhvi_df <- zhvi %>%
  mutate(
    StateCodeFIP = str_pad(as.character(StateCodeFIPS), width = 2, side = "left", pad = "0"),
    MunicipalCodeFIP = str_pad(as.character(MunicipalCodeFIPS), width = 3, side = "left", pad = "0")
  )
# Create place identifier
zhvi_df <- zhvi_df %>%
  mutate(
    place = paste0("geoId/", StateCodeFIP, MunicipalCodeFIP)
  )

Note: The place column creates a unique identifier for each county by combining state and municipal FIPS codes, which will be used to join with geospatial and embedding data.

Geospatial Data Integration

Load County Geometries

Code
county_gdf <- st_read("https://github.com/zyang91/Google-Embedding-tutorial/releases/download/v2.0.0/county.geojson",quiet=TRUE)

couty_gdf <- county_gdf %>%
  dplyr::select(place)

Join ZHVI with County Geometries

Code
zhvi_county_gdf<- county_gdf %>%
  inner_join(
    zhvi_df,
    by = c("place" = "place")
  )

Visualizing Home Values

Prepare Data for Visualization

Code
# Select specific date column for visualization
target_date <- "X2024.10.31"
viz_gdf <- zhvi_county_gdf %>%
  dplyr::select(RegionName, State, all_of(target_date), geometry)

Create 2D Choropleth Map

Code
# Create interactive map with Leaflet
pal <- colorNumeric(
  palette = "Blues",
  domain = viz_gdf[[target_date]],
  na.color = "transparent"
)

leaflet(viz_gdf) %>%
  addTiles() %>%
  addPolygons(
    fillColor = ~pal(get(target_date)),
    fillOpacity = 0.7,
    color = "white",
    weight = 1,
    popup = ~paste0(
      "<strong>", RegionName, ", ", State, "</strong><br>",
      "Home Value: $", format(get(target_date), big.mark = ",")
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal = pal,
    values = ~get(target_date),
    title = "Zillow Home Median Value",
    opacity = 1
  )

Note: This creates an interactive map where users can hover over counties to see home values. The blue color gradient represents the magnitude of home values.

PDFM Embeddings Integration

Load PDFM Embeddings

Code
# Load pre-computed PDFM embeddings
embeddings <- read_csv("https://github.com/zyang91/Google-Embedding-tutorial/releases/download/v2.0.0/county_embeddings.csv")

About PDFM Embeddings: These 330-dimensional vectors encode complex spatial patterns including:

  • Population mobility patterns
  • Search behavior trends
  • Local economic activity indicators
  • Environmental conditions
  • Demographic characteristics

Visualize Single Embedding Feature

Code
# Join embeddings with county geometries
df_embed <- county_gdf %>%
  inner_join(embeddings,
    by = "place"
  )

# Select one embedding feature to visualize
feature_col <- "feature329"
viz_embed <- df_embed %>%
  dplyr::select(state, all_of(feature_col), geometry)
Code
# Create map
pal_embed <- colorNumeric(
  palette = "Blues",
  domain = viz_embed[[feature_col]],
  na.color = "transparent"
)

leaflet(viz_embed) %>%
  addTiles() %>%
  addPolygons(
    fillColor = ~pal_embed(get(feature_col)),
    fillOpacity = 0.7,
    color = "white",
    weight = 1,
    popup = ~paste0(
      "<strong>", state, "</strong><br>",
      feature_col, ": ", round(get(feature_col), 4)
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal = pal_embed,
    values = ~get(feature_col),
    title = feature_col,
    opacity = 1
  )

Note: Each of the 330 embedding features captures different spatial patterns. Feature329 is visualized here as an example.

Regression Modeling

Prepare Training Data

Code
# Join ZHVI with embeddings
data <- zhvi_df %>%
  inner_join(
    embeddings,
    by = "place"
  )

# Define embedding features and target variable
embedding_features <- paste0("feature", 0:329)
target_label <- "X2024.10.31"

# Remove rows with missing target values
data <- data %>%
  filter(!is.na(get(target_label)))

# Select only features and target for modeling
modeling_data <- data %>%
  dplyr::select(all_of(c(embedding_features, target_label)))
modeling_data <- modeling_data %>%
  mutate(index = row_number())
# Split into training and testing sets (80/20 split)
train_indices <- createDataPartition(modeling_data$index, p = 0.8, list = FALSE)
train_data <- modeling_data[train_indices, ]
test_data <- modeling_data[-train_indices, ]

Test data:604 Train data:2431

Model 1: Linear Regression (Baseline)

Code
# refine train data
train_data <- train_data %>%
  dplyr::select(-index)

train_data<- train_data %>%
  rename(target = X2024.10.31)

# Fit linear regression model using all features
lr_model <- lm(target ~ ., data = train_data)

summary(lr_model)

Call:
lm(formula = target ~ ., data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-376020  -29025    -988   27176  758839 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.923e+05  3.417e+04   8.554  < 2e-16 ***
feature0    -6.484e+04  4.664e+04  -1.390 0.164591    
feature1    -1.890e+04  5.206e+03  -3.629 0.000291 ***
feature2    -9.262e+03  6.099e+03  -1.519 0.129003    
feature3    -1.944e+02  5.940e+03  -0.033 0.973901    
feature4    -1.000e+03  5.509e+03  -0.182 0.855949    
feature5    -3.751e+03  4.974e+03  -0.754 0.450865    
feature6    -5.508e+03  5.543e+03  -0.994 0.320564    
feature7    -1.079e+04  5.613e+03  -1.922 0.054740 .  
feature8    -2.761e+03  5.373e+03  -0.514 0.607335    
feature9    -1.680e+04  5.227e+03  -3.215 0.001326 ** 
feature10    1.849e+03  5.412e+03   0.342 0.732617    
feature11    2.269e+03  5.422e+03   0.418 0.675678    
feature12    2.725e+04  5.481e+03   4.971 7.19e-07 ***
feature13    2.693e+04  5.759e+03   4.677 3.09e-06 ***
feature14    2.102e+03  5.674e+03   0.370 0.711086    
feature15    2.586e+04  5.688e+03   4.545 5.80e-06 ***
feature16    1.292e+04  6.260e+03   2.064 0.039172 *  
feature17   -1.871e+04  4.968e+03  -3.765 0.000171 ***
feature18   -1.067e+04  5.602e+03  -1.905 0.056909 .  
feature19   -2.534e+04  5.272e+03  -4.806 1.65e-06 ***
feature20   -6.973e+03  4.941e+03  -1.411 0.158381    
feature21   -1.663e+04  5.008e+03  -3.321 0.000912 ***
feature22    9.916e+03  5.768e+03   1.719 0.085748 .  
feature23    3.453e+04  5.701e+03   6.056 1.65e-09 ***
feature24   -1.408e+04  4.706e+04  -0.299 0.764823    
feature25    2.313e+03  5.249e+03   0.441 0.659473    
feature26   -7.118e+02  5.816e+03  -0.122 0.902602    
feature27    1.906e+03  5.959e+03   0.320 0.749089    
feature28    1.033e+04  6.071e+03   1.701 0.089122 .  
feature29    1.481e+04  5.775e+03   2.564 0.010431 *  
feature30   -6.455e+03  5.632e+03  -1.146 0.251873    
feature31   -3.879e+02  5.539e+03  -0.070 0.944175    
feature32    6.302e+03  5.052e+03   1.247 0.212370    
feature33    3.139e+04  5.408e+03   5.803 7.49e-09 ***
feature34   -3.321e+03  5.220e+03  -0.636 0.524703    
feature35    2.918e+04  6.084e+03   4.797 1.73e-06 ***
feature36    1.710e+03  5.249e+03   0.326 0.744653    
feature37   -5.855e+03  5.742e+03  -1.020 0.308033    
feature38   -1.128e+04  5.782e+03  -1.951 0.051186 .  
feature39   -2.087e+03  4.609e+03  -0.453 0.650734    
feature40   -1.478e+04  4.830e+03  -3.060 0.002242 ** 
feature41   -2.229e+04  5.793e+03  -3.848 0.000123 ***
feature42   -9.822e+03  5.209e+03  -1.886 0.059497 .  
feature43   -3.839e+03  5.325e+03  -0.721 0.471002    
feature44    2.298e+04  5.672e+03   4.052 5.26e-05 ***
feature45   -1.500e+04  5.647e+03  -2.656 0.007962 ** 
feature46   -9.565e+02  5.391e+03  -0.177 0.859182    
feature47    9.350e+03  5.738e+03   1.630 0.103354    
feature48   -2.983e+03  5.205e+03  -0.573 0.566709    
feature49   -4.090e+03  5.177e+03  -0.790 0.429584    
feature50    1.425e+04  5.577e+03   2.555 0.010682 *  
feature51    5.759e+03  5.474e+03   1.052 0.292947    
feature52   -1.329e+04  5.300e+03  -2.508 0.012231 *  
feature53    1.104e+04  5.498e+03   2.008 0.044747 *  
feature54   -4.584e+03  5.339e+03  -0.859 0.390698    
feature55   -1.421e+03  5.824e+03  -0.244 0.807308    
feature56   -9.396e+03  5.481e+03  -1.714 0.086634 .  
feature57   -3.887e+04  5.274e+03  -7.371 2.42e-13 ***
feature58   -6.546e+03  5.264e+03  -1.244 0.213803    
feature59   -1.552e+04  5.285e+03  -2.937 0.003347 ** 
feature60    1.546e+04  5.352e+03   2.888 0.003912 ** 
feature61    1.253e+04  5.852e+03   2.140 0.032442 *  
feature62    7.411e+02  5.660e+03   0.131 0.895836    
feature63    3.691e+03  5.993e+03   0.616 0.538064    
feature64    2.213e+03  5.341e+03   0.414 0.678596    
feature65   -2.288e+04  5.608e+03  -4.080 4.67e-05 ***
feature66   -1.797e+04  5.094e+03  -3.528 0.000428 ***
feature67    1.293e+04  5.624e+03   2.299 0.021606 *  
feature68   -1.094e+04  5.291e+03  -2.068 0.038740 *  
feature69    5.633e+03  5.360e+03   1.051 0.293468    
feature70    3.634e+03  5.024e+03   0.723 0.469615    
feature71   -2.033e+04  5.520e+03  -3.683 0.000236 ***
feature72    1.429e+03  5.157e+03   0.277 0.781714    
feature73   -1.893e+04  5.822e+03  -3.252 0.001166 ** 
feature74   -1.495e+04  5.001e+03  -2.989 0.002830 ** 
feature75   -3.786e+03  5.072e+03  -0.747 0.455404    
feature76    1.450e+05  4.749e+04   3.053 0.002293 ** 
feature77    4.332e+04  5.854e+03   7.399 1.97e-13 ***
feature78    1.631e+04  5.466e+03   2.984 0.002877 ** 
feature79    2.559e+04  5.962e+03   4.292 1.85e-05 ***
feature80   -1.601e+03  5.732e+03  -0.279 0.780022    
feature81    3.645e+04  5.842e+03   6.240 5.28e-10 ***
feature82    3.046e+04  5.481e+03   5.558 3.08e-08 ***
feature83   -4.520e+03  5.843e+03  -0.774 0.439251    
feature84   -1.994e+04  5.488e+03  -3.634 0.000286 ***
feature85    2.423e+04  5.692e+03   4.257 2.17e-05 ***
feature86    1.020e+04  6.270e+03   1.626 0.104003    
feature87   -3.117e+04  5.528e+03  -5.638 1.95e-08 ***
feature88   -8.138e+03  5.295e+03  -1.537 0.124499    
feature89   -1.740e+04  4.782e+03  -3.638 0.000281 ***
feature90   -3.430e+03  6.131e+03  -0.559 0.575896    
feature91    3.688e+04  5.434e+03   6.788 1.47e-11 ***
feature92   -1.167e+04  5.197e+03  -2.246 0.024787 *  
feature93   -3.999e+03  5.586e+03  -0.716 0.474151    
feature94    3.826e+04  5.701e+03   6.712 2.46e-11 ***
feature95   -3.883e+04  6.359e+03  -6.107 1.21e-09 ***
feature96   -1.535e+04  5.856e+03  -2.621 0.008823 ** 
feature97    2.035e+03  5.889e+03   0.346 0.729740    
feature98   -1.053e+04  5.896e+03  -1.787 0.074162 .  
feature99    7.540e+03  5.979e+03   1.261 0.207426    
feature100  -1.756e+04  5.719e+03  -3.070 0.002167 ** 
feature101   1.548e+04  5.441e+03   2.846 0.004473 ** 
feature102   1.352e+04  5.656e+03   2.390 0.016946 *  
feature103   2.439e+04  5.725e+03   4.260 2.14e-05 ***
feature104  -2.295e+04  5.329e+03  -4.306 1.74e-05 ***
feature105  -1.771e+03  5.465e+03  -0.324 0.745878    
feature106  -7.883e+03  5.229e+03  -1.508 0.131805    
feature107  -1.385e+04  5.449e+03  -2.542 0.011097 *  
feature108   4.519e+04  6.781e+03   6.664 3.39e-11 ***
feature109  -1.057e+04  5.632e+03  -1.877 0.060636 .  
feature110   4.495e+03  5.690e+03   0.790 0.429661    
feature111   2.867e+04  5.947e+03   4.821 1.53e-06 ***
feature112   5.693e+03  5.667e+03   1.005 0.315212    
feature113  -4.713e+03  5.597e+03  -0.842 0.399782    
feature114   4.099e+03  5.926e+03   0.692 0.489179    
feature115  -4.339e+03  5.502e+03  -0.789 0.430451    
feature116   6.155e+03  5.765e+03   1.068 0.285803    
feature117  -5.910e+03  4.294e+03  -1.376 0.168814    
feature118   1.899e+03  5.149e+03   0.369 0.712390    
feature119  -3.195e+03  5.244e+03  -0.609 0.542345    
feature120  -1.044e+03  5.787e+03  -0.180 0.856806    
feature121  -6.598e+03  5.689e+03  -1.160 0.246281    
feature122  -3.457e+04  6.160e+03  -5.613 2.25e-08 ***
feature123  -4.675e+03  5.486e+03  -0.852 0.394148    
feature124  -2.770e+04  5.094e+03  -5.437 6.06e-08 ***
feature125   3.368e+04  6.147e+03   5.478 4.81e-08 ***
feature126   1.053e+04  5.869e+03   1.793 0.073063 .  
feature127   2.075e+04  5.508e+03   3.767 0.000170 ***
feature128  -2.126e+01  4.490e+03  -0.005 0.996223    
feature129   1.579e+04  1.184e+04   1.333 0.182739    
feature130  -5.363e+03  4.978e+03  -1.077 0.281466    
feature131  -3.811e+03  3.341e+03  -1.141 0.254195    
feature132  -1.194e+04  8.623e+03  -1.385 0.166323    
feature133   5.748e+03  5.745e+03   1.000 0.317201    
feature134  -2.098e+03  7.242e+04  -0.029 0.976895    
feature135   7.097e+03  1.288e+04   0.551 0.581683    
feature136  -2.899e+03  1.341e+04  -0.216 0.828872    
feature137  -8.275e+04  7.399e+04  -1.118 0.263526    
feature138   1.743e+05  6.397e+04   2.725 0.006479 ** 
feature139   5.676e+04  6.966e+04   0.815 0.415254    
feature140   4.231e+04  1.256e+04   3.369 0.000768 ***
feature141  -9.457e+03  1.035e+04  -0.913 0.361210    
feature142   5.172e+03  6.141e+03   0.842 0.399777    
feature143  -9.761e+03  9.786e+03  -0.997 0.318652    
feature144   5.040e+04  1.237e+04   4.073 4.81e-05 ***
feature145  -1.329e+05  5.913e+04  -2.247 0.024725 *  
feature146   1.159e+04  1.070e+04   1.082 0.279169    
feature147  -9.822e+02  4.805e+03  -0.204 0.838074    
feature148   1.240e+04  7.181e+03   1.727 0.084338 .  
feature149   4.676e+02  5.855e+03   0.080 0.936350    
feature150   6.264e+04  5.951e+04   1.053 0.292622    
feature151   4.345e+01  3.146e+03   0.014 0.988980    
feature152  -8.972e+04  6.556e+04  -1.368 0.171318    
feature153  -4.021e+04  1.546e+04  -2.601 0.009357 ** 
feature154  -2.750e+02  1.146e+04  -0.024 0.980862    
feature155   9.481e+03  6.966e+03   1.361 0.173658    
feature156   2.138e+03  7.523e+03   0.284 0.776314    
feature157   1.342e+04  1.204e+04   1.114 0.265277    
feature158  -9.103e+08  3.537e+09  -0.257 0.796893    
feature159  -2.136e+04  1.403e+04  -1.522 0.128133    
feature160  -3.596e+03  3.839e+03  -0.937 0.348997    
feature161  -7.898e+01  6.365e+03  -0.012 0.990102    
feature162  -1.023e+04  1.032e+04  -0.991 0.321875    
feature163   5.567e+04  5.363e+04   1.038 0.299349    
feature164  -7.599e+03  3.745e+03  -2.029 0.042607 *  
feature165  -2.290e+04  7.699e+03  -2.974 0.002972 ** 
feature166   9.422e+02  5.764e+04   0.016 0.986959    
feature167   1.304e+04  5.708e+04   0.228 0.819362    
feature168   1.001e+05  6.673e+04   1.500 0.133863    
feature169   2.110e+05  6.573e+04   3.210 0.001348 ** 
feature170   1.004e+05  5.518e+04   1.820 0.068971 .  
feature171  -1.818e+04  1.071e+04  -1.698 0.089747 .  
feature172  -1.067e+05  6.661e+04  -1.601 0.109450    
feature173  -1.254e+04  5.471e+03  -2.292 0.021981 *  
feature174  -6.760e+03  6.325e+03  -1.069 0.285276    
feature175   1.317e+05  7.704e+04   1.710 0.087441 .  
feature176  -2.983e+03  4.591e+03  -0.650 0.516003    
feature177   1.624e+03  2.954e+03   0.550 0.582632    
feature178   1.114e+04  1.027e+04   1.085 0.278185    
feature179   1.347e+03  7.355e+03   0.183 0.854706    
feature180  -3.947e+03  9.036e+03  -0.437 0.662332    
feature181  -2.386e+04  6.382e+03  -3.739 0.000190 ***
feature182  -4.717e+03  8.251e+03  -0.572 0.567602    
feature183   1.562e+03  3.317e+03   0.471 0.637671    
feature184   1.546e+05  6.194e+04   2.496 0.012626 *  
feature185  -7.612e+02  5.275e+03  -0.144 0.885278    
feature186   5.186e+03  5.944e+03   0.873 0.382994    
feature187   1.887e+04  9.275e+03   2.035 0.041977 *  
feature188  -2.257e+03  5.644e+03  -0.400 0.689314    
feature189   2.148e+03  5.713e+03   0.376 0.707012    
feature190  -6.345e+03  1.668e+04  -0.380 0.703668    
feature191  -9.327e+04  5.730e+04  -1.628 0.103732    
feature192   9.891e+04  8.122e+04   1.218 0.223458    
feature193   7.396e+03  6.077e+03   1.217 0.223749    
feature194  -1.573e+05  5.810e+04  -2.708 0.006824 ** 
feature195   6.023e+03  5.212e+03   1.156 0.247937    
feature196   2.299e+05  7.168e+04   3.208 0.001358 ** 
feature197   4.481e+04  5.742e+04   0.780 0.435231    
feature198  -1.499e+02  9.767e+03  -0.015 0.987756    
feature199  -7.827e+04  5.342e+04  -1.465 0.143052    
feature200   3.919e+03  6.045e+03   0.648 0.516842    
feature201  -1.371e+04  1.810e+04  -0.757 0.448931    
feature202   1.300e+04  1.412e+04   0.921 0.357398    
feature203  -1.620e+04  5.272e+04  -0.307 0.758642    
feature204  -1.700e+03  7.334e+03  -0.232 0.816778    
feature205  -2.272e+04  5.860e+04  -0.388 0.698301    
feature206   9.101e+04  6.805e+04   1.337 0.181262    
feature207   5.414e+04  8.522e+03   6.352 2.59e-10 ***
feature208   1.615e+04  6.543e+04   0.247 0.805118    
feature209  -4.616e+04  1.335e+04  -3.458 0.000555 ***
feature210  -6.910e+03  9.694e+03  -0.713 0.476057    
feature211   2.386e+03  1.503e+04   0.159 0.873891    
feature212  -3.547e+04  1.527e+04  -2.323 0.020285 *  
feature213  -1.713e+05  5.824e+04  -2.941 0.003312 ** 
feature214  -3.778e+03  5.138e+03  -0.735 0.462259    
feature215  -1.127e+03  5.238e+03  -0.215 0.829712    
feature216  -3.234e+04  6.384e+04  -0.507 0.612515    
feature217  -9.149e+03  4.123e+03  -2.219 0.026602 *  
feature218  -2.571e+04  1.028e+04  -2.501 0.012444 *  
feature219   8.461e+03  1.403e+04   0.603 0.546586    
feature220   3.763e+03  1.013e+04   0.371 0.710341    
feature221  -1.667e+04  2.381e+04  -0.700 0.483959    
feature222   2.652e+04  1.241e+04   2.136 0.032775 *  
feature223   7.303e+04  6.008e+04   1.215 0.224340    
feature224  -7.310e+04  8.271e+04  -0.884 0.376864    
feature225  -3.368e+04  1.483e+04  -2.272 0.023210 *  
feature226  -1.314e+05  5.742e+04  -2.289 0.022204 *  
feature227  -8.502e+03  1.225e+04  -0.694 0.487563    
feature228  -9.595e+03  7.501e+03  -1.279 0.200999    
feature229   8.698e+03  8.375e+04   0.104 0.917285    
feature230   8.051e+04  1.557e+04   5.172 2.53e-07 ***
feature231   1.781e+04  1.614e+04   1.103 0.270096    
feature232   1.110e+02  6.524e+03   0.017 0.986421    
feature233  -2.238e+04  2.118e+04  -1.056 0.290870    
feature234   2.052e+04  1.825e+04   1.124 0.261094    
feature235  -1.374e+04  4.907e+03  -2.799 0.005166 ** 
feature236  -2.629e+03  9.922e+03  -0.265 0.791029    
feature237  -4.339e+03  4.707e+03  -0.922 0.356798    
feature238   3.852e+03  6.369e+04   0.060 0.951786    
feature239  -1.894e+04  1.028e+04  -1.843 0.065541 .  
feature240  -8.501e+03  5.459e+03  -1.557 0.119549    
feature241   5.802e+03  6.435e+03   0.902 0.367386    
feature242  -6.039e+03  7.393e+03  -0.817 0.414139    
feature243   3.022e+04  6.481e+04   0.466 0.641033    
feature244   2.110e+04  1.107e+04   1.906 0.056781 .  
feature245   3.703e+04  5.322e+04   0.696 0.486665    
feature246  -6.135e+03  8.861e+03  -0.692 0.488770    
feature247  -1.546e+05  6.139e+04  -2.518 0.011892 *  
feature248   4.183e+03  5.065e+03   0.826 0.408944    
feature249   1.015e+04  1.532e+04   0.663 0.507507    
feature250   3.404e+04  1.347e+04   2.527 0.011570 *  
feature251  -4.872e+03  9.242e+03  -0.527 0.598160    
feature252  -3.918e+04  1.377e+04  -2.845 0.004491 ** 
feature253   1.164e+05  6.168e+04   1.888 0.059203 .  
feature254  -1.729e+05  7.181e+04  -2.407 0.016154 *  
feature255  -9.907e+03  6.626e+04  -0.150 0.881151    
feature256   1.004e+03  2.544e+03   0.394 0.693312    
feature257   3.717e+03  3.254e+03   1.142 0.253439    
feature258  -2.176e+03  3.304e+03  -0.659 0.510255    
feature259   1.809e+03  3.847e+03   0.470 0.638314    
feature260   7.726e+02  3.490e+03   0.221 0.824835    
feature261  -1.868e+03  2.728e+03  -0.685 0.493494    
feature262  -8.123e+03  3.373e+03  -2.408 0.016105 *  
feature263   8.999e+03  3.941e+03   2.283 0.022506 *  
feature264  -6.545e+03  3.539e+03  -1.850 0.064503 .  
feature265   3.401e+03  3.532e+03   0.963 0.335796    
feature266  -8.592e+03  3.202e+03  -2.684 0.007340 ** 
feature267   5.477e+04  5.483e+04   0.999 0.317957    
feature268   9.102e+02  3.822e+03   0.238 0.811766    
feature269  -9.589e+03  3.967e+03  -2.417 0.015731 *  
feature270   4.323e+03  3.908e+03   1.106 0.268770    
feature271   1.410e+03  2.833e+03   0.498 0.618650    
feature272   3.846e+03  3.838e+03   1.002 0.316448    
feature273   1.772e+03  3.687e+03   0.481 0.630913    
feature274   2.348e+03  3.554e+03   0.661 0.508825    
feature275  -3.518e+03  3.437e+03  -1.024 0.306149    
feature276   9.059e+02  3.231e+03   0.280 0.779242    
feature277   5.254e+03  2.862e+03   1.836 0.066550 .  
feature278   3.607e+03  3.234e+03   1.115 0.264834    
feature279  -6.503e+03  3.908e+03  -1.664 0.096272 .  
feature280   2.292e+03  2.957e+03   0.775 0.438402    
feature281  -4.644e+03  3.702e+03  -1.255 0.209744    
feature282  -2.065e+03  3.490e+03  -0.592 0.554038    
feature283   2.242e+03  3.732e+03   0.601 0.548041    
feature284   4.647e+03  3.343e+03   1.390 0.164673    
feature285   3.772e+02  3.212e+03   0.117 0.906533    
feature286   3.963e+03  3.169e+03   1.251 0.211194    
feature287  -2.063e+03  1.536e+03  -1.343 0.179493    
feature288  -6.668e+03  2.911e+03  -2.290 0.022104 *  
feature289  -1.099e+04  2.774e+03  -3.963 7.66e-05 ***
feature290  -5.420e+03  2.889e+03  -1.876 0.060838 .  
feature291   8.285e+03  3.339e+03   2.481 0.013182 *  
feature292   4.481e+03  4.070e+03   1.101 0.270970    
feature293  -5.847e+03  3.282e+03  -1.782 0.074932 .  
feature294  -4.980e+03  3.438e+03  -1.449 0.147595    
feature295   4.927e+03  3.097e+03   1.591 0.111744    
feature296  -8.711e+03  2.512e+03  -3.467 0.000537 ***
feature297  -5.638e+03  3.892e+03  -1.449 0.147603    
feature298   1.434e+03  3.150e+03   0.455 0.648981    
feature299  -5.802e+03  3.719e+03  -1.560 0.118883    
feature300  -5.712e+03  3.124e+03  -1.829 0.067613 .  
feature301  -2.137e+03  2.539e+03  -0.842 0.400011    
feature302   5.517e+03  2.826e+03   1.952 0.051058 .  
feature303  -1.182e+04  4.186e+03  -2.823 0.004805 ** 
feature304   1.170e+04  3.105e+03   3.769 0.000169 ***
feature305   5.272e+03  3.143e+03   1.678 0.093569 .  
feature306   4.727e+03  4.769e+03   0.991 0.321739    
feature307  -7.177e+03  2.558e+03  -2.806 0.005070 ** 
feature308   9.804e+03  3.746e+03   2.617 0.008936 ** 
feature309   1.822e+04  3.847e+03   4.736 2.32e-06 ***
feature310   2.236e+02  2.331e+03   0.096 0.923611    
feature311   1.456e+02  3.988e+03   0.037 0.970887    
feature312  -5.665e+03  3.490e+03  -1.623 0.104743    
feature313  -2.280e+03  3.457e+03  -0.660 0.509605    
feature314  -1.620e+03  2.798e+03  -0.579 0.562686    
feature315  -7.906e+01  2.455e+03  -0.032 0.974313    
feature316  -6.722e+03  3.591e+03  -1.872 0.061374 .  
feature317  -9.344e+03  3.307e+03  -2.826 0.004760 ** 
feature318  -9.038e+02  3.091e+03  -0.292 0.770022    
feature319  -1.135e+03  3.687e+03  -0.308 0.758185    
feature320  -1.835e+02  3.665e+03  -0.050 0.960071    
feature321  -8.836e+00  3.030e+03  -0.003 0.997673    
feature322   5.219e+03  2.947e+03   1.771 0.076683 .  
feature323  -6.288e+02  3.378e+03  -0.186 0.852329    
feature324   1.704e+03  3.356e+03   0.508 0.611776    
feature325  -5.609e+03  3.899e+03  -1.438 0.150457    
feature326   1.778e+03  1.073e+03   1.658 0.097559 .  
feature327  -2.830e+03  2.988e+03  -0.947 0.343671    
feature328  -5.539e+04  5.781e+04  -0.958 0.338057    
feature329  -3.916e+03  2.409e+03  -1.626 0.104145    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 62540 on 2100 degrees of freedom
Multiple R-squared:  0.8829,    Adjusted R-squared:  0.8645 
F-statistic:    48 on 330 and 2100 DF,  p-value: < 2.2e-16

On training data, adjusted R-squared: 0.8645, which indicates the model explains a high proportion of variance in home values.

Code
# Make predictions on test set
y_pred_lr <- predict(lr_model, newdata = test_data%>%dplyr::select(-index, -X2024.10.31))

y_test <- test_data$X2024.10.31

# Calculate evaluation metrics
lr_mae <- mae(y_test, y_pred_lr)
lr_rmse <- rmse(y_test, y_pred_lr)
lr_r2 <- cor(y_test, y_pred_lr)^2
# Display results
lr_results <- data.frame(
  Metric = c("MAE", "RMSE", "R²"),
  Value = c(
    round(lr_mae, 2),
    round(lr_rmse, 2),
    round(lr_r2, 4)
  )
)
kable(lr_results,
      caption = "Linear Regression Performance Metrics",
      col.names = c("Metric", "Value")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Linear Regression Performance Metrics
Metric Value
MAE 42972.1200
RMSE 59355.7100
0.8565

Linear Regression Interpretation:

  • Uses all 330 embedding features
  • No feature selection - may include irrelevant features
  • Serves as baseline for comparison

Model 2: Stepwise Regression

Code
# Perform stepwise regression using AIC criterion
# Direction "both" allows both forward and backward selection
# step_model <- stepAIC(
#   lr_model,
#   direction = "both",
#   trace = TRUE  # Set to TRUE to see step-by-step selection
# )

Stepwise Regression Interpretation:

  • Automatically selects most predictive features using AIC (Akaike Information Criterion)
  • Balances model fit with complexity
  • Typically results in a more parsimonious model
  • Shows which embedding dimensions are most important for prediction
  • Not possible (it will compute over 54000 models, roughly takes around 6-7 days to finish)

Model 3: Ridge Regression

Code
# Prepare matrix format for glmnet (Ridge regression requires matrix input)
x_train <- as.matrix(train_data %>% dplyr::select(-target))
y_train <- train_data$target
x_test <- as.matrix(test_data %>% dplyr::select(-index, -X2024.10.31))

# Perform cross-validation to find optimal lambda
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0, nfolds = 10)

# Fit Ridge regression model with optimal lambda
ridge_model <- glmnet(x_train, y_train, alpha = 0, lambda = cv_ridge$lambda.min)

# Make predictions on test set
y_pred_ridge <- predict(ridge_model, newx = x_test, s = cv_ridge$lambda.min)

# Calculate evaluation metrics
ridge_mae <- mae(y_test, y_pred_ridge)
ridge_rmse <- rmse(y_test, y_pred_ridge)
ridge_r2 <- cor(y_test, y_pred_ridge)^2

# Display results
ridge_results <- data.frame(
  Metric = c("MAE", "RMSE", "R²", "Optimal Lambda"),
  Value = c(
    round(ridge_mae, 2),
    round(ridge_rmse, 2),
    round(ridge_r2, 4),
    round(cv_ridge$lambda.min, 4)
  )
)

kable(ridge_results,
      caption = "Ridge Regression Performance Metrics",
      col.names = c("Metric", "Value")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Ridge Regression Performance Metrics
Metric Value
MAE 40895.1000
RMSE 57471.6900
0.8606
Optimal Lambda 14210.2992

Ridge Regression Interpretation:

  • Uses L2 regularization to penalize large coefficients
  • Shrinks coefficients but keeps all 330 features (does not perform feature selection)
  • Lambda parameter controls the amount of regularization
  • Cross-validation used to find optimal lambda value
  • Helps prevent overfitting compared to standard linear regression

Model 4: Lasso Regression

Code
# Perform cross-validation to find optimal lambda for Lasso
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 10)

# Fit Lasso regression model with optimal lambda
lasso_model <- glmnet(x_train, y_train, alpha = 1, lambda = cv_lasso$lambda.min)

# Make predictions on test set
y_pred_lasso <- predict(lasso_model, newx = x_test, s = cv_lasso$lambda.min)

# Calculate evaluation metrics
lasso_mae <- mae(y_test, y_pred_lasso)
lasso_rmse <- rmse(y_test, y_pred_lasso)
lasso_r2 <- cor(y_test, y_pred_lasso)^2

# Count number of non-zero coefficients (features selected)
lasso_coefs <- coef(lasso_model, s = cv_lasso$lambda.min)
n_features_selected <- sum(lasso_coefs != 0) - 1  # Subtract 1 for intercept

# Display results
lasso_results <- data.frame(
  Metric = c("MAE", "RMSE", "R²", "Optimal Lambda", "Features Selected"),
  Value = c(
    round(lasso_mae, 2),
    round(lasso_rmse, 2),
    round(lasso_r2, 4),
    round(cv_lasso$lambda.min, 4),
    n_features_selected
  )
)

kable(lasso_results,
      caption = "Lasso Regression Performance Metrics",
      col.names = c("Metric", "Value")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Lasso Regression Performance Metrics
Metric Value
MAE 41944.9500
RMSE 58234.7900
0.8581
Optimal Lambda 587.1713
Features Selected 229.0000

Lasso Regression Interpretation:

  • Uses L1 regularization to penalize large coefficients
  • Performs automatic feature selection by shrinking some coefficients to exactly zero
  • Lambda parameter controls the amount of regularization
  • Cross-validation used to find optimal lambda value
  • Results in a sparse model with fewer features than Ridge regression
  • Helps identify the most important embedding dimensions for prediction

Model Comparison

Code
# Create comparison dataframe with all models
model_comparison <- data.frame(
  Model = c("Linear Regression", "Ridge Regression", "Lasso Regression"),
  MAE = c(
    round(lr_mae, 2),
    round(ridge_mae, 2),
    round(lasso_mae, 2)
  ),
  RMSE = c(
    round(lr_rmse, 2),
    round(ridge_rmse, 2),
    round(lasso_rmse, 2)
  ),
  R_squared = c(
    round(lr_r2, 4),
    round(ridge_r2, 4),
    round(lasso_r2, 4)
  ),
  Features_Used = c(
    330,
    330,
    n_features_selected
  )
)

kable(model_comparison,
      caption = "Comparison of Regression Models",
      col.names = c("Model", "MAE", "RMSE", "R²", "Features Used")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(which.min(model_comparison$RMSE), bold = TRUE, color = "white", background = "#4CAF50")
Comparison of Regression Models
Model MAE RMSE Features Used
Linear Regression 42972.12 59355.71 0.8565 330
Ridge Regression 40895.10 57471.69 0.8606 330
Lasso Regression 41944.95 58234.79 0.8581 229

Key Insights:

  • Compare which model performs better on unseen data
  • Ridge regression uses all features but with regularization to prevent overfitting
  • Lasso regression performs feature selection, using fewer features while maintaining accuracy
  • Feature reduction can lead to better interpretability and faster predictions
  • The best performing model (lowest RMSE) is highlighted in green

Visualization of Model Performance

Actual vs. Predicted Plot - Linear Regression

Code
# Create evaluation dataframe for linear regression
eval_df_lr <- data.frame(
  actual = y_test,
  predicted = y_pred_lr
)

# Plot
ggplot(eval_df_lr, aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  coord_fixed(xlim = c(0, 1000000), ylim = c(0, 1000000)) +
  labs(
    title = "Linear Regression: Actual vs Predicted",
    subtitle = paste0("R² = ", round(lr_r2, 4)),
    x = "Actual Home Value ($)",
    y = "Predicted Home Value ($)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  ) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma)

Actual vs. Predicted Plot - Ridge Regression

Code
# Create evaluation dataframe for Ridge regression
eval_df_ridge <- data.frame(
  actual = y_test,
  predicted = as.vector(y_pred_ridge)
)

# Plot
ggplot(eval_df_ridge, aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.5, color = "darkgreen") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  coord_fixed(xlim = c(0, 1000000), ylim = c(0, 1000000)) +
  labs(
    title = "Ridge Regression: Actual vs Predicted",
    subtitle = paste0("R² = ", round(ridge_r2, 4)),
    x = "Actual Home Value ($)",
    y = "Predicted Home Value ($)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  ) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma)

Interpretation of Scatter Plots:

  • Points along the red diagonal line indicate perfect predictions
  • Points above the line = model overestimates home value
  • Points below the line = model underestimates home value
  • Tighter clustering around the diagonal = better model performance

Actual vs. Predicted Plot - Lasso Regression

Code
# Create evaluation dataframe for Lasso regression
eval_df_lasso <- data.frame(
  actual = y_test,
  predicted = as.vector(y_pred_lasso)
)

# Plot
ggplot(eval_df_lasso, aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.5, color = "darkorange") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  coord_fixed(xlim = c(0, 1000000), ylim = c(0, 1000000)) +
  labs(
    title = "Lasso Regression: Actual vs Predicted",
    subtitle = paste0("R² = ", round(lasso_r2, 4), " | Features: ", n_features_selected, "/330"),
    x = "Actual Home Value ($)",
    y = "Predicted Home Value ($)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  ) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma)

Spatial Visualization of Prediction Errors

Calculate Prediction Differences

Code
# Add test data indices back to identify counties
test_data_with_place <- modeling_data[-train_indices, ] %>%
  mutate(row_idx = row_number())

# Get the original place identifiers for test data
test_places <- data[test_data$index, "place"]

# Create comparison dataframe
prediction_comparison <- data.frame(
  place = test_places,
  actual = y_test,
  pred_lr = y_pred_lr,
  pred_ridge = as.vector(y_pred_ridge),
  pred_lasso = as.vector(y_pred_lasso)
) %>%
  mutate(
    diff_lr = pred_lr - actual,
    diff_ridge = pred_ridge - actual,
    diff_lasso = pred_lasso - actual
  )

Map Prediction Errors

Linear Regression Errors:

Code
# Join prediction differences with county geometries
error_map_data_lr <- county_gdf %>%
  inner_join(prediction_comparison, by = "place")
# Create color palette for differences (red = underestimate, blue = overestimate)
pal_diff_lr <- colorNumeric(
  palette = c("blue", "white", "red"),
  domain = c(-200000, 200000),
  na.color = "gray"
)
# Create interactive map showing Linear regression errors
leaflet(error_map_data_lr) %>%
  addTiles() %>%
  addPolygons(
    fillColor = ~pal_diff_lr(diff_lr),
    fillOpacity = 0.7,
    color = "white",
    weight = 1,
    popup = ~paste0(
      "<strong>County</strong><br>",
      "Actual: $", format(round(actual), big.mark = ","), "<br>",
      "Predicted (Linear): $", format(round(pred_lr), big.mark = ","), "<br>",
      "Difference: $", format(round(diff_lr), big.mark = ",")
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal = pal_diff_lr,
    values = ~diff_lr,
    title = "Prediction Error<br>(Linear Model)",
    opacity = 1,
    labFormat = labelFormat(prefix = "$")
  )

Ridge Regression Errors:

Code
# Join prediction differences with county geometries
error_map_data_ridge <- county_gdf %>%
  inner_join(prediction_comparison, by = "place")
# Create color palette for differences (red = underestimate, blue = overestimate)
pal_diff_ridge <- colorNumeric(
  palette = c("blue", "white", "red"),
  domain = c(-200000, 200000),
  na.color = "gray"
)
# Create interactive map showing Ridge regression errors
leaflet(error_map_data_ridge) %>%
  addTiles() %>%
  addPolygons(
    fillColor = ~pal_diff_ridge(diff_ridge),
    fillOpacity = 0.7,
    color = "white",
    weight = 1,
    popup = ~paste0(
      "<strong>County</strong><br>",
      "Actual: $", format(round(actual), big.mark = ","), "<br>",
      "Predicted (Ridge): $", format(round(pred_ridge), big.mark = ","), "<br>",
      "Difference: $", format(round(diff_ridge), big.mark = ",")
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal = pal_diff_ridge,
    values = ~diff_ridge,
    title = "Prediction Error<br>(Ridge Model)",
    opacity = 1,
    labFormat = labelFormat(prefix = "$")
  )

Lasso Regression Errors:

Code
# Join prediction differences with county geometries
error_map_data <- county_gdf %>%
  inner_join(prediction_comparison, by = "place")

# Create color palette for differences (red = underestimate, blue = overestimate)
pal_diff <- colorNumeric(
  palette = c("blue", "white", "red"),
  domain = c(-200000, 200000),
  na.color = "gray"
)

# Create interactive map showing Lasso regression errors
leaflet(error_map_data) %>%
  addTiles() %>%
  addPolygons(
    fillColor = ~pal_diff(diff_lasso),
    fillOpacity = 0.7,
    color = "white",
    weight = 1,
    popup = ~paste0(
      "<strong>County</strong><br>",
      "Actual: $", format(round(actual), big.mark = ","), "<br>",
      "Predicted (Lasso): $", format(round(pred_lasso), big.mark = ","), "<br>",
      "Difference: $", format(round(diff_lasso), big.mark = ",")
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal = pal_diff,
    values = ~diff_lasso,
    title = "Prediction Error<br>(Lasso Model)",
    opacity = 1,
    labFormat = labelFormat(prefix = "$")
  )

Spatial Error Analysis:

  • Red areas: Model underestimates home values (actual > predicted)
  • Blue areas: Model overestimates home values (actual < predicted)
  • White areas: Predictions close to actual values
  • This spatial visualization can reveal geographic patterns in model performance
  • Systematic errors in specific regions may indicate missing spatial features or local market conditions not captured by embeddings

Summary and Conclusions

Key Findings

  1. PDFM Embeddings as Features: The 330-dimensional PDFM embeddings successfully capture spatial patterns relevant to home value prediction.

  2. Model Performance:

    • Linear regression provides a straightforward baseline using all features
    • Ridge regression uses L2 regularization to reduce overfitting while keeping all features
    • Lasso regression performs automatic feature selection through L1 regularization
  3. Regularization Benefits:

    • Reduced overfitting compared to standard linear regression
    • Ridge: Shrinks coefficients but maintains all 330 features
    • Lasso: Identifies most important embedding dimensions through feature selection
    • Cross-validation ensures optimal regularization strength (lambda)
    • Improved generalization to unseen data
  4. Spatial Patterns: Error maps reveal geographic variations in prediction accuracy, suggesting opportunities for:

    • Regional model calibration
    • Incorporation of additional local features
    • Investigation of market-specific factors

Methodological Considerations

Advantages of Using Embeddings:

  • Captures complex, non-linear relationships
  • Incorporates diverse data sources (mobility, search trends, environment)
  • Transfer learning from large-scale models
  • Rich spatial representation

Limitations:

  • Black-box nature makes interpretation challenging
  • Embeddings may encode biases from training data

Future Directions

  1. Advanced Models: Try ensemble methods (Random Forest, XGBoost) or neural networks
  2. Feature Engineering: Combine embeddings with traditional features (square footage, bedrooms, etc.)

References